1 The problem: predicting credit card fraud

The goal of this project is to predict fraudulent credit card transactions.

The data set we will use consists of credit card transactions and it includes information about each transaction including customer details, the merchant and category of purchase, and whether or not the transaction was a fraud.

1.1 Obtain the data

## Rows: 671,028
## Columns: 14
## $ trans_date_trans_time <dttm> 2019-02-22 07:32:58, 2019-02-16 15:07:20, 2019-…
## $ trans_year            <dbl> 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2020, …
## $ category              <chr> "entertainment", "kids_pets", "personal_care", "…
## $ amt                   <dbl> 7.79, 3.89, 8.43, 40.00, 54.04, 95.61, 64.95, 3.…
## $ city                  <chr> "Veedersburg", "Holloway", "Arnold", "Apison", "…
## $ state                 <chr> "IN", "OH", "MO", "TN", "CO", "GA", "MN", "AL", …
## $ lat                   <dbl> 40.1186, 40.0113, 38.4305, 35.0149, 39.4584, 32.…
## $ long                  <dbl> -87.2602, -80.9701, -90.3870, -85.0164, -106.385…
## $ city_pop              <dbl> 4049, 128, 35439, 3730, 277, 1841, 136, 190178, …
## $ job                   <chr> "Development worker, community", "Child psychoth…
## $ dob                   <date> 1959-10-19, 1946-04-03, 1985-03-31, 1991-01-28,…
## $ merch_lat             <dbl> 39.41679, 39.74585, 37.73078, 34.53277, 39.95244…
## $ merch_long            <dbl> -87.52619, -81.52477, -91.36875, -84.10676, -106…
## $ is_fraud              <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …

We also add some variables to isolate the hour of the day, day of the week, month etc. corresponding to the transactions as well as the age of the customers to assist in exploratory data analysis:

# use lubridate to isolate day of week, month, hour of day, age etc. 
card_fraud <- card_fraud %>% 
  mutate( hour = hour(trans_date_trans_time),
          wday = wday(trans_date_trans_time, label = TRUE),
          month_name = month(trans_date_trans_time, label = TRUE),
          age = interval(dob, trans_date_trans_time) / years(1)
) %>% 
  rename(year = trans_year) %>% 

# use lat, long to calculate distance between transaction and cardholder's home   
  mutate(
    
# convert latitude/longitude to radians
    lat1_radians = lat / 57.29577951,
    lat2_radians = merch_lat / 57.29577951,
    long1_radians = long / 57.29577951,
    long2_radians = merch_long / 57.29577951,
    
# calculate distance in miles
    distance_miles = 3963.0 * acos((sin(lat1_radians) * sin(lat2_radians)) + cos(lat1_radians) * cos(lat2_radians) * cos(long2_radians - long1_radians)),

# calculate distance in km
    distance_km = 6377.830272 * acos((sin(lat1_radians) * sin(lat2_radians)) + cos(lat1_radians) * cos(lat2_radians) * cos(long2_radians - long1_radians))

  )

1.2 Exploratory Data Analysis (EDA)

Let’s explore the data set and understand some useful features of it.

  • First, let’s understand how many transactions were actually fraud in this data set:
# group transactions by year 
card_fraud %>% 
  group_by(year) %>% 
# count number of fraudulent and non- fraudulent transactions 
  count(is_fraud) %>% 
# calculate variable for frequency of fraud 
  mutate(percentage = n/sum(n) *100) 
## # A tibble: 4 × 4
## # Groups:   year [2]
##    year is_fraud      n percentage
##   <dbl> <fct>     <int>      <dbl>
## 1  2019 1          2721      0.568
## 2  2019 0        475925     99.4  
## 3  2020 1          1215      0.632
## 4  2020 0        191167     99.4
  • Approximately 0.6% of transactions were fraudulent in both 2019 and 2020.

  • Next, examine date/time variables: when does fraud occur? Are there some weekdays, months or hours of the day when fraud occurs more frequently? :

# plot bar graph to investigate fraud by day of the week 
card_fraud %>% 
  
# filter only for fraudulent transactions
  filter(is_fraud==1) %>% 
  
# count number of transactions of fraud by weekday 
  group_by(wday) %>% 
  count() %>% 
  
# plot bars for weekdays 
  ggplot(aes(x = wday, y = n)) + 
  geom_bar(stat="identity") +
  
# add some text to show count of transactions for each day 
  geom_text(aes(label = n, y= n - 10), 
            colour = "white", size = 4 , vjust = 2) +
  labs(x='', y='', 
       title = "Number of occurences of fraud by weekday")

# plot bar graph to investigate fraud by month of the year 
card_fraud %>% 
  
# filter only for fraudulent transactions
  filter(is_fraud==1) %>% 
  
# count number of transactions of fraud by month 
  group_by(month_name) %>% 
  count() %>% 
  
# plot bars for months 
  ggplot(aes(x = month_name, y = n)) + 
  geom_bar(stat="identity") +
  
# add some text to show count of transactions for each month 
  geom_text(aes(label = n, y= n - 10), 
            colour = "white", size = 4 , vjust = 2) +
  labs(x='', y='', 
       title = "Number of occurences of fraud by month")

# make a table to investigate which hours of the day are most affected 
card_fraud %>% 
  
# filter only for fraudulent transactions
  filter(is_fraud==1) %>% 
  
# count number of transactions of fraud by hour of the day 
  group_by(hour) %>% 
  count() %>%
# arrange from most fraud occurences to least 
  arrange(desc(n))
## # A tibble: 24 × 2
## # Groups:   hour [24]
##     hour     n
##    <int> <int>
##  1    23  1012
##  2    22   981
##  3     0   348
##  4     1   332
##  5     3   326
##  6     2   313
##  7    19    52
##  8    18    49
##  9    17    48
## 10    13    45
## # ℹ 14 more rows
  • Fraud seems to occur far more frequently during night time hours when people are asleep and more frequently on weekend days as these are generally the times during which people are paying less attention to their bank accounts. Furthermore, fraud is far more common in the first six months of the year than the last. It is unclear why.

  • Next, we turn our attention to the numerical variables: what is the distribution of amounts of fraudulent and legitimate transactions? :

# some quick summary stats using summarize 
# group by fraud versus legitimate 
card_fraud %>% 
  group_by(is_fraud) %>% 
# calculate summary stats of amounts for each
  summarize(mean_amount= mean(amt), 
            median_amount = median(amt), 
            minimum_amount = min(amt), 
            maximum_amount = max(amt))
## # A tibble: 2 × 5
##   is_fraud mean_amount median_amount minimum_amount maximum_amount
##   <fct>          <dbl>         <dbl>          <dbl>          <dbl>
## 1 1              527.          369.            1.06          1334.
## 2 0               67.6          47.2           1            27120.
# now plot histogram of fraudulent transaction amounts  
card_fraud %>%
# only fraudulent transactions
  filter(is_fraud==1) %>% 
# plot distribution - make bins look as good as possible 
  ggplot(aes(x = amt)) +
  geom_histogram(bins=50) +

  labs(title = "Distribution of transactions amounts
       \nof fraudulent transactions", 
       x = "Transaction amount / $", y="")

  • We see that the distribution of fraudulent transactions has a mean of about $500 and barely any amounts go above $1000. This makes sense as fraudsters would want transactions to be worth it, but without attracting too much attention. Many tiny transactions or one very large transaction would get the attention of the victim more quickly. Contrarily, legitimate transactions obviously have a much wider range and lower mean/median value.

  • Now let’s consider geospatial variables: select the 400 cities (roughly half of the number of cities represented) with the most legitimate/fraud transactions and then use leaflet library to help us plot an interactive map of the lat and long locations of where legitimate/fraud transactions took place:

# arrange cities in order of number of transactions
top_fraud_cities <- card_fraud %>%
  count(city) %>% 
  arrange(desc(n)) %>% 
  
# slice 400 most represented cities 
  slice(1:400) %>%
  select(city) %>%
  pull()
  
# create colour vector 
colours <- c('#e41a1c','#377eb8')

# create function that assigns colours to fraud/legit transactions 
point_colour <- colorFactor(palette = colours,  
                          card_fraud$is_fraud)


# filter only for top fraud cities 
card_fraud %>%
  filter(city %in% top_fraud_cities) %>% 
# use leaflet and openstreetmaps for interactive map 
  leaflet() %>% 
  addProviderTiles("OpenStreetMap.Mapnik") %>% 

# create circles at lat and long locations 
  addCircleMarkers(lng = ~long, 
                   lat = ~lat, 
                   radius = 1, 
                   
# colour fraud/legit transaction points 
                   color = ~point_colour(is_fraud), 
                   fillOpacity = 0.6, 
                   label = ~is_fraud) %>%
# map legend  
  addLegend("bottomright", pal = point_colour, 
            values = ~is_fraud,
            title = "Fraud")
  • Furthermore, we have calculated the distances between transaction locations and the home location of customers. Do these distances affect the likelihood of fraud occurring? :
# create new column to turn is_fraud into characters 
card_fraud %>% 
  mutate(is_fraud_new= as.character(is_fraud)) %>% 
  
# violin plot of distances 
  ggplot(aes(x = is_fraud_new, y=distance_km)) +
         geom_violin() + 
# label fraud categories correctly 
  scale_x_discrete(labels= c("Fraudulent","Legitimate")) +
  
  labs(x='', y = "Distance / km" , 
       title = "Is fraudulent activity affected by distance between \ncardholder's home and transaction location")

  • There seems to be absolutely no relationship between the distance between the cardholders’ home and the transaction location and whether fraudulent activity occurs. This is not a useful data factor to explain fraud.

  • Lastly we examine our categorical variables: category of merchant and job of victim:

  • Which category of merchant is most likely to experience fraudulent transactions? :

card_fraud %>% 
# filter for only fraud transactions
  filter(is_fraud==1) %>% 
  
# count them by category and rearrange bars based on count
  count(category) %>% 
  mutate(category = fct_reorder(category,n)) %>%
  
# calculate percentage of fraud by catgeory of merchant 
  mutate(percentage = n/sum(n)*100) %>% 
  
# plot categories of fraud percentage frequency
  ggplot(aes(y=category, x=percentage)) + 
  geom_bar(stat = "identity") +
  labs(title="Percentage of fraudulent transactions 
       \nby type of merchant", 
       x = "Percentage of transactions",
       y = "Type of merchant")

  • Are customers with certain jobs more likely to experience fraud? :
card_fraud %>% 
  
# filter for only fraud transactions
  filter(is_fraud==1) %>% 
# group by job 
  group_by(job) %>% 
  
# count fraud transactions by customer job 
  count() %>% 
  arrange(desc(n))
## # A tibble: 441 × 2
## # Groups:   job [441]
##    job                          n
##    <chr>                    <int>
##  1 Materials engineer          31
##  2 Audiological scientist      30
##  3 Mechanical engineer         30
##  4 Copywriter, advertising     29
##  5 Film/video editor           27
##  6 Exhibition designer         26
##  7 Surveyor, land/geomatics    26
##  8 Energy engineer             25
##  9 Financial trader            25
## 10 Naval architect             24
## # ℹ 431 more rows

There is clearly a large discrepancy between certain types of merchant category and customer job in this data set. Let us combine the smallest represented categories of these into “Other” and convert the string variables to factors. This is much more useful for machine learning purposes and will enhance the predictive performance of our model: we will do the category combining in our recipe to follow later:

# convert the two string variables to factors 
card_fraud <- card_fraud %>% 
  mutate(category = factor(category),
         job = factor(job))

1.3 Fit workflows on smaller data sample

This data set has 670K rows and trying various models may break things…

Thus, we will work with a smaller sample of 10% of the values the original data set to identify the best model, and once we have the best model we can use the full data set to train and test our best model.

# select a smaller subset
my_card_fraud <- card_fraud %>% 
  
# select a smaller subset, 10% of the entire dataframe 
  slice_sample(prop = 0.10) %>% 
  
# pick our variables of interest 
  select(category, amt, distance_miles, city_pop, hour, wday, month_name, age, is_fraud)

1.4 Split the data in training - testing

set.seed(123)

# split our smaller data set 80% training, 20% testing
# with equal proportions of fraud transactions in each set 
data_split <- initial_split(my_card_fraud, 
                           prop = 0.8, 
                           strata = is_fraud)

# split the data 
card_fraud_train <- training(data_split) 
card_fraud_test <- testing(data_split)

1.5 Cross Validation

We will use 3 folds for cross validation:

set.seed(123)

# 3-fold validation for training data 
# equal proportion of fraud transactions in each fold 
cv_folds <- vfold_cv(data = card_fraud_train, 
                          v = 3, 
                          strata = is_fraud)
cv_folds 
## #  3-fold cross-validation using stratification 
## # A tibble: 3 × 2
##   splits                id   
##   <list>                <chr>
## 1 <split [35787/17894]> Fold1
## 2 <split [35787/17894]> Fold2
## 3 <split [35788/17893]> Fold3

1.6 Define a recipe

We define our pre-processor recipes steps below:

# predict fraud transaction for training data with following recipe 
fraud_rec <- recipe(is_fraud ~ ., data = card_fraud_train) %>%
  
# Categories with less than 5% will be in a category 'Other'
    
  step_other(category, threshold = .05) %>% 

# log amounts data becuase this is very disperse 
  step_log(amt) %>% 

# skip all NA's 
  step_naomit(everything(), skip = TRUE) %>% 
    
# deal with variables not encountered in training data 
  step_novel(all_nominal(), -all_outcomes()) %>%
    
# Convert nominal data into numeric dummy variables
  step_dummy(all_nominal(), -all_outcomes()) %>%
    
# deal with numeric variables that contain only a single value
  step_zv(all_numeric(), -all_outcomes()) %>% 
  
# center and scale all numeric variables 
  step_normalize(all_numeric(), -all_outcomes())

Check the pre-processed data frame :

prepped_data <- 
  fraud_rec %>% # use the recipe object
  prep() %>% # perform the recipe on training data
  juice() # extract only the preprocessed dataframe 

glimpse(prepped_data)
## Rows: 53,681
## Columns: 36
## $ amt                     <dbl> 0.17148395, 0.88512260, 1.00878720, -1.6631655…
## $ distance_miles          <dbl> -1.3772146, -2.2131451, -1.0702907, 0.7966128,…
## $ city_pop                <dbl> -0.112042925, -0.295246584, -0.181217511, 1.23…
## $ hour                    <dbl> 1.0601295, 1.5001222, 1.3534580, -1.4331628, 1…
## $ age                     <dbl> 1.4532901, 1.5923248, 0.1474289, 0.7426346, -0…
## $ is_fraud                <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ category_food_dining    <dbl> -0.2729857, -0.2729857, -0.2729857, -0.2729857…
## $ category_gas_transport  <dbl> -0.3354959, -0.3354959, -0.3354959, -0.3354959…
## $ category_grocery_pos    <dbl> -0.3250891, -0.3250891, -0.3250891, -0.3250891…
## $ category_health_fitness <dbl> -0.2651429, -0.2651429, -0.2651429, -0.2651429…
## $ category_home           <dbl> -0.3248788, 3.0780132, -0.3248788, -0.3248788,…
## $ category_kids_pets      <dbl> 3.2136121, -0.3111705, -0.3111705, -0.3111705,…
## $ category_misc_pos       <dbl> -0.2558877, -0.2558877, -0.2558877, 3.9078915,…
## $ category_personal_care  <dbl> -0.2714067, -0.2714067, -0.2714067, -0.2714067…
## $ category_shopping_net   <dbl> -0.2868772, -0.2868772, -0.2868772, -0.2868772…
## $ category_shopping_pos   <dbl> -0.3178938, -0.3178938, 3.1456457, -0.3178938,…
## $ category_other          <dbl> -0.360463, -0.360463, -0.360463, -0.360463, -0…
## $ wday_1                  <dbl> -0.8051577, 1.0672275, 1.5353238, -1.2732540, …
## $ wday_2                  <dbl> 0.2587946, -0.7015175, 0.2587946, 1.6992628, 0…
## $ wday_3                  <dbl> 1.1141244, -1.0535402, -0.6922628, -1.0535402,…
## $ wday_4                  <dbl> -1.1949572, -0.1064591, -1.1949572, 0.9820390,…
## $ wday_5                  <dbl> 1.3152828, 0.9762827, -1.2837179, -0.3797177, …
## $ wday_6                  <dbl> -0.7814827, 1.6343804, -0.7814827, 0.2538872, …
## $ wday_7                  <dbl> 0.38113840, 1.09496460, -0.33268780, -0.026762…
## $ month_name_01           <dbl> 1.12717537, -1.21275245, -0.33527952, -0.04278…
## $ month_name_02           <dbl> -0.3088140, 1.1369694, -0.7606213, -1.0317056,…
## $ month_name_03           <dbl> -1.40430930, -0.07938478, 1.07992418, 0.583077…
## $ month_name_04           <dbl> -0.6379908, -0.8128107, 0.3089502, 1.0810713, …
## $ month_name_05           <dbl> 0.7250231, 1.7367758, -0.9765610, -0.7006284, …
## $ month_name_06           <dbl> 1.3535100, -1.5380702, 0.7338856, -0.5053630, …
## $ month_name_07           <dbl> 0.9437507, 1.4623115, 0.8310201, 1.2255772, 1.…
## $ month_name_08           <dbl> -0.2951161, -0.9389022, -1.2250294, 0.2234894,…
## $ month_name_09           <dbl> -1.277372826, 0.653770738, 0.346543353, -1.233…
## $ month_name_10           <dbl> -1.6674716175, -0.3153962357, 1.1684492046, 0.…
## $ month_name_11           <dbl> -1.097250658, 0.112030026, -1.651504305, 1.341…
## $ month_name_12           <dbl> -0.42005293, 0.02638127, 1.11456464, -1.647747…

1.7 Define various models

We define the following classification models:

  1. Logistic regression, using the glm engine
  2. Decision tree, using the C5.0 engine
  3. Random Forest, using the ranger engine and setting importance = "impurity")
  4. A boosted tree using Extreme Gradient Boosting, and the xgboost engine
  5. A k-nearest neighbours, using 4 nearest_neighbors and the kknn engine
# 1. Pick a `model type`
# 2. set the `engine`
# 3. Set the `mode`: classification

# Logistic regression
log_spec <-  logistic_reg() %>%  # model type
  set_engine(engine = "glm") %>%  # model engine
  set_mode("classification") # model mode

# Show  model specification
log_spec
## Logistic Regression Model Specification (classification)
## 
## Computational engine: glm
# Decision Tree
tree_spec <- decision_tree() %>%
  set_engine(engine = "C5.0") %>%
  set_mode("classification")

tree_spec
## Decision Tree Model Specification (classification)
## 
## Computational engine: C5.0
# Random Forest
library(ranger)

rf_spec <- 
  rand_forest() %>% 
  set_engine("ranger", importance = "impurity") %>% 
  set_mode("classification")


# Boosted tree (XGBoost)
library(xgboost)

xgb_spec <- 
  boost_tree() %>% 
  set_engine("xgboost") %>% 
  set_mode("classification") 

# K-nearest neighbour (k-NN)
knn_spec <- 
  nearest_neighbor(neighbors = 4) %>% 
  set_engine("kknn") %>% 
  set_mode("classification") 

1.8 Bundle recipe and model with workflows

# Bundle recipe and model into workflows 

# Logistic regression
log_wflow <- # new workflow object
 workflow() %>% # use workflow function
 add_recipe(fraud_rec) %>%   # add recipe
 add_model(log_spec)   # add model spec

# show object
log_wflow
## ══ Workflow ════════════════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: logistic_reg()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 7 Recipe Steps
## 
## • step_other()
## • step_log()
## • step_naomit()
## • step_novel()
## • step_dummy()
## • step_zv()
## • step_normalize()
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## Logistic Regression Model Specification (classification)
## 
## Computational engine: glm
# Decision Tree
tree_wflow <-
 workflow() %>%
 add_recipe(fraud_rec) %>% 
 add_model(tree_spec) 


# Random Forest
rf_wflow <-
 workflow() %>%
 add_recipe(fraud_rec) %>% 
 add_model(rf_spec) 


# Boosted tree (XGBoost)
xgb_wflow <-
 workflow() %>%
 add_recipe(fraud_rec) %>% 
 add_model(xgb_spec)


# K-nearest neighbour (k-NN)
knn_wflow <-
 workflow() %>%
 add_recipe(fraud_rec) %>% 
 add_model(knn_spec)

1.9 Fit models

# fit all 5 models and use tic to compare the times they take to run 
# resample using our 3 folds 
# collect model performance metrics 

# Logistic regression
tic()
log_res <- log_wflow %>% 
  fit_resamples(
    resamples = cv_folds, 
    metrics = metric_set(
      recall, precision, f_meas, accuracy,
      kap, roc_auc, sens, spec),
    control = control_resamples(save_pred = TRUE)) 
time_log <- toc()
## 4.982 sec elapsed
log_time <- time_log[[4]]


# Decision Tree
tic()
tree_res <- tree_wflow %>% 
  fit_resamples(
    resamples = cv_folds, 
    metrics = metric_set(
      recall, precision, f_meas, accuracy,
      kap, roc_auc, sens, spec),
    control = control_resamples(save_pred = TRUE)) 
time_tree <- toc()
## 18.987 sec elapsed
tree_time <- time_tree[[4]]


# Random Forest
tic()
rf_res <- rf_wflow %>% 
  fit_resamples(
    resamples = cv_folds, 
    metrics = metric_set(
      recall, precision, f_meas, accuracy,
      kap, roc_auc, sens, spec),
    control = control_resamples(save_pred = TRUE)) 
time_rf <- toc()
## 43.713 sec elapsed
rf_time <- time_rf[[4]]


# Boosted tree (XGBoost)
tic()
xgb_res <- xgb_wflow %>% 
  fit_resamples(
    resamples = cv_folds, 
    metrics = metric_set(
      recall, precision, f_meas, accuracy,
      kap, roc_auc, sens, spec),
    control = control_resamples(save_pred = TRUE)) 
time_xgb <- toc()
## 3.797 sec elapsed
xgb_time <- time_xgb[[4]]


# K-nearest neighbour (k-NN)
tic()
knn_res <- knn_wflow %>% 
  fit_resamples(
    resamples = cv_folds, 
    metrics = metric_set(
      recall, precision, f_meas, accuracy,
      kap, roc_auc, sens, spec),
    control = control_resamples(save_pred = TRUE)) 
time_knn <- toc()
## 184.174 sec elapsed
knn_time <- time_knn[[4]]

1.10 Compare models

# Model Comparison

log_metrics <- 
  log_res %>% 
  collect_metrics(summarise = TRUE) %>%
  # add the name of the model to every row
  mutate(model = "Logistic Regression",
         time = log_time)

tree_metrics <- 
  tree_res %>% 
  collect_metrics(summarise = TRUE) %>%
  mutate(model = "Decision Tree",
         time = tree_time)

rf_metrics <- 
  rf_res %>% 
  collect_metrics(summarise = TRUE) %>%
  mutate(model = "Random Forest",
         time = rf_time)

xgb_metrics <- 
  xgb_res %>% 
  collect_metrics(summarise = TRUE) %>%
  mutate(model = "XGBoost",
         time = xgb_time)

knn_metrics <- 
  knn_res %>% 
  collect_metrics(summarise = TRUE) %>%
  mutate(model = "Knn",
         time = knn_time)


# create dataframe with all models
model_compare <- bind_rows(log_metrics,
                            tree_metrics,
                            rf_metrics,
                           xgb_metrics,
                           knn_metrics
                      ) %>% 
# get rid of 'sec elapsed' and turn it into a number
  mutate(time = str_sub(time, end = -13) %>% 
           as.double()
         )

# pivot wider to create barplot
  model_comp <- model_compare %>% 
  select(model, .metric, mean, std_err) %>% 
  pivot_wider(names_from = .metric, values_from = c(mean, std_err)) 

# show mean area under the curve (ROC-AUC) for every model
model_comp %>% 
  arrange(mean_roc_auc) %>% 
  
# order results with highest area bar on top 
  mutate(model = fct_reorder(model, mean_roc_auc)) %>% 
  
# column chart with model name on y-axis, mean are under ROC curve on x-axis
  ggplot(aes(model, mean_roc_auc, fill=model)) +
  geom_col() +
  coord_flip() +
  scale_fill_brewer(palette = "Blues") +
   geom_text(
     size = 3,
     aes(label = round(mean_roc_auc, 2), 
         y = mean_roc_auc + 0.08),
     vjust = 1
  )+
  theme_light()+
  theme(legend.position = "none")+
  labs(y = "Mean area under ROC curve", 
       title ="Which model performs best?")

From the graph above, the Random Forest and XGBoost models seem to be far superior to the others in terms of sensitivity and specificity i.e. they predict far fewer false positive/negatives.

1.11 Last_fit metrics

Since the Random Forest and XGBoost models seem to have outperformed the others, we will use these workflows to do a last_fit test:

# last fit on rf workflow using data_split 
last_fit_rf <- last_fit(rf_wflow, 
                        split = data_split,
                        
# summarize important metrics 
                        metrics = metric_set(
                          accuracy, f_meas, kap, precision,
                          recall, roc_auc, sens, spec))

last_fit_rf %>% collect_metrics(summarize = TRUE)
## # A tibble: 8 × 4
##   .metric   .estimator .estimate .config             
##   <chr>     <chr>          <dbl> <chr>               
## 1 accuracy  binary         0.997 Preprocessor1_Model1
## 2 f_meas    binary         0.587 Preprocessor1_Model1
## 3 kap       binary         0.586 Preprocessor1_Model1
## 4 precision binary         1     Preprocessor1_Model1
## 5 recall    binary         0.416 Preprocessor1_Model1
## 6 sens      binary         0.416 Preprocessor1_Model1
## 7 spec      binary         1     Preprocessor1_Model1
## 8 roc_auc   binary         0.988 Preprocessor1_Model1
# Compare to training model
rf_res %>% collect_metrics(summarize = TRUE)
## # A tibble: 8 × 6
##   .metric   .estimator  mean     n   std_err .config             
##   <chr>     <chr>      <dbl> <int>     <dbl> <chr>               
## 1 accuracy  binary     0.996     3 0.000113  Preprocessor1_Model1
## 2 f_meas    binary     0.361     3 0.0340    Preprocessor1_Model1
## 3 kap       binary     0.360     3 0.0338    Preprocessor1_Model1
## 4 precision binary     0.980     3 0.0196    Preprocessor1_Model1
## 5 recall    binary     0.222     3 0.0244    Preprocessor1_Model1
## 6 roc_auc   binary     0.970     3 0.00399   Preprocessor1_Model1
## 7 sens      binary     0.222     3 0.0244    Preprocessor1_Model1
## 8 spec      binary     1.00      3 0.0000187 Preprocessor1_Model1
# last fit on xgb workflow using data_split 
last_fit_xgb <- last_fit(xgb_wflow, 
                        split = data_split,
                        
# summarize important metrics 
                        metrics = metric_set(
                          accuracy, f_meas, kap, precision,
                          recall, roc_auc, sens, spec))

last_fit_xgb %>% collect_metrics(summarize = TRUE)
## # A tibble: 8 × 4
##   .metric   .estimator .estimate .config             
##   <chr>     <chr>          <dbl> <chr>               
## 1 accuracy  binary         0.998 Preprocessor1_Model1
## 2 f_meas    binary         0.770 Preprocessor1_Model1
## 3 kap       binary         0.769 Preprocessor1_Model1
## 4 precision binary         0.897 Preprocessor1_Model1
## 5 recall    binary         0.675 Preprocessor1_Model1
## 6 sens      binary         0.675 Preprocessor1_Model1
## 7 spec      binary         1.00  Preprocessor1_Model1
## 8 roc_auc   binary         0.954 Preprocessor1_Model1
# Compare to training model
xgb_res %>% collect_metrics(summarize = TRUE)
## # A tibble: 8 × 6
##   .metric   .estimator  mean     n  std_err .config             
##   <chr>     <chr>      <dbl> <int>    <dbl> <chr>               
## 1 accuracy  binary     0.997     3 0.000229 Preprocessor1_Model1
## 2 f_meas    binary     0.726     3 0.0133   Preprocessor1_Model1
## 3 kap       binary     0.725     3 0.0134   Preprocessor1_Model1
## 4 precision binary     0.886     3 0.0253   Preprocessor1_Model1
## 5 recall    binary     0.618     3 0.0281   Preprocessor1_Model1
## 6 roc_auc   binary     0.974     3 0.000932 Preprocessor1_Model1
## 7 sens      binary     0.618     3 0.0281   Preprocessor1_Model1
## 8 spec      binary     1.00      3 0.000135 Preprocessor1_Model1

From the tables, over fitting has clearly not taken place as the model fits the training and testing data equally well. This is excellent and also more true for the XGBoost model. We will choose this as the “best” model and move forward in analysis with only XGBoost.

1.12 Get variable importance using vip package

# use vip package 
# extract the fit from the xgb workflow 
last_fit_xgb %>% 
  pluck(".workflow", 1) %>%   
  extract_fit_parsnip() %>% 
  
# top 10 most important variables 
  vip(num_features = 10) +
  theme_light() + 
  labs(title = "Variable importance for XGBoost model fit")

It would appear that transaction amount was by far the most important variable in predicting whether a transaction was fraudulent. We discovered this might be the case in our earlier EDA as legitimate transactions had a far greater spread and lower mean than fraudulent transactions.

1.13 Plot Final Confusion matrix and ROC curve

# Final Confusion Matrix for XGBoost model 

last_fit_xgb %>%
# collect preditions and plot confusion matrix 
  collect_predictions() %>% 
  conf_mat(is_fraud, .pred_class) %>% 
  autoplot(type = "heatmap") +
  
# fix axis labels and title 
  scale_x_discrete(labels= c("Fraudulent","Legitimate")) +
  scale_y_discrete(labels= c("Legitimate","Fraudulent")) +
  labs(title="Confusion matrix for XGBoost model")

# Final ROC curve for XGBoost model
last_fit_xgb %>% 
  collect_predictions() %>% 
  roc_curve(is_fraud, .pred_1) %>% 
  autoplot() +
  labs(title="ROC curve for XGBoost model")

An excellent confusion matrix and a near perfect ROC curve. This is a very good model!

1.14 Calculating the cost of fraud to the company

  • How much money (in US$ terms) are fraudulent transactions costing the company? Generate a table that summarizes the total amount of legitimate and fraudulent transactions per year and calculate the % of fraudulent transactions, in US$ terms.
# use xgb workflow to get predictions
xgb_preds <- 
  xgb_wflow %>% 
  fit(data = card_fraud_train) %>%  
  
# Use `augment()` to get predictions for entire data set
  augment(new_data = card_fraud)

# confusion matrix
xgb_preds %>% 
  conf_mat(truth = is_fraud, estimate = .pred_class)
##           Truth
## Prediction      1      0
##          1   2766    219
##          0   1170 666873
# select only transaction amount and fraud predictions 
cost <- xgb_preds %>%
  select(is_fraud,year, amt, pred = .pred_class) 

cost <- cost %>%
  
  # naive false-- we think every single transaction is ok and not fraud
  mutate(false_naive = ifelse(is_fraud == 1, amt, 0)) %>% 

  # false negatives-- we thought they were not fraud, but they were
  mutate(false_negatives = ifelse(pred == 0 & is_fraud == 1, amt, 0)) %>% 

  # false positives-- we thought they were fraud, but they were not
  mutate(false_positives = ifelse(pred == 1 & is_fraud == 0, amt, 0)) %>% 

  # true positives-- we thought they were fraud, and they were 
  mutate(true_positives = ifelse(pred == 1 & is_fraud == 1, amt, 0)) %>% 

  # true negatives-- we thought they were ok, and they were 
  mutate(true_negatives = ifelse(pred == 0 & is_fraud == 0, amt, 0))

  
# Summarising

cost_summary <- cost %>% 
  summarise(across(starts_with(c("false","true")), 
            ~ sum(.x, na.rm = TRUE)))

cost_summary
## # A tibble: 1 × 5
##   false_naive false_negatives false_positives true_positives true_negatives
##         <dbl>           <dbl>           <dbl>          <dbl>          <dbl>
## 1    2075089.         362761.         161391.        1712328      44947425.
# how much money are fraudulent transactions costing the company?
# group transactions by year and fraud status
cost %>% 
  group_by(year, is_fraud) %>%
  
# calculate total dollar amount for each fraud group and year 
  summarize(total_amount_dollars = sum(amt)) %>% 
  
# calculate percentage of total amount for each fraud group 
  mutate(percentage_dollars = 
           total_amount_dollars/sum(total_amount_dollars)*100)  
## `summarise()` has grouped output by 'year'. You can override using the
## `.groups` argument.
## # A tibble: 4 × 4
## # Groups:   year [2]
##    year is_fraud total_amount_dollars percentage_dollars
##   <dbl> <fct>                   <dbl>              <dbl>
## 1  2019 1                    1423140.               4.23
## 2  2019 0                   32182901.              95.8 
## 3  2020 1                     651949.               4.80
## 4  2020 0                   12925914.              95.2
  • Compare your model vs the naive classification that we do not have any fraudulent transactions. The $ improvement of our model over the naive policy equals cost_summary$false_naive - cost_summary$false_negatives - cost_summary$false_positives * 0.02)

    = $1670262
cost_summary %>% 
  summarise(improvment_dollars = false_naive - false_negatives - 0.02*false_positives)
## # A tibble: 1 × 1
##   improvment_dollars
##                <dbl>
## 1           1709100.